Characterizing the community structure of complex networks 
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Community structure is one of the key properties of complex networks and plays a crucial role 
in their topology and function. While an impressive amount of work has been done on the issue 
of community detection, very little attention has been so far devoted to the investigation of com- 
munities in real networks. We present a systematic empirical analysis of the statistical properties 
of communities in large information, communication, technological, biological, and social networks. 
We find that the mesoscopic organization of networks of the same category is remarkably similar. 
This is reflected in several characteristics of community structure, which can be used as "finger- 
prints" of specific network categories. While community size distributions are always broad, certain 
categories of networks consist mainly of tree-like communities, while others have denser modules. 
Average path lengths within communities initially grow logarithmically with community size, but 
the growth saturates or slows down for communities larger than a characteristic size. This behaviour 
is related to the presence of hubs within communities, whose roles differ across categories. Also the 
community embeddedness of nodes, measured in terms of the fraction of links within their commu- 
nities, has a characteristic distribution for each category. Our findings are verified by the use of two 
fundamentally different community detection methods. 
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I. INTRODUCTION 

The modern science of complex systems has experi- 
enced a significant advance after the discovery that the 
graph representation of such systems, despite its simplic- 
ity, reveals a set of crucial features that suffice to disclose 
their general structural properties, function and evolu- 
tion mechanisms [1^7]. Representing a complex system 
as a graph means turning the elementary units of the sys- 
tem into nodes, while links between nodes indicate their 
mutual interactions or relations. Many complex networks 
arc characterized by a broad distribution of the number 
of neighbors of a node, i.e. its degree. This is responsi- 
ble of peculiar properties such as high robustness against 
random failures |S] and the absence of a threshold for the 
spreading of epidemics [9] . 

Another important feature of complex networks is rep- 
resented by their mesoscopic structure, characterized by 
the presence of groups of nodes, called communities or 
modules, with a high density of links between nodes of 
the same group and a comparatively low density of links 
between nodes of different groups [T0Hl3] . This com- 
partmental organization of networks is very common in 
systems of diverse origin. It was remarked already in the 
1960's that a hierarchical modular structure is necessary 
for the robustness and stability of complex systems, and 
gives them an evolutionary advantage |14j . 

Exploring network communities is important for three 
main reasons: 1) to reveal network organization at a 
coarse level, which may help to formulate realistic mech- 
anisms for its genesis and evolution; 2) to better un- 
derstand dynamic processes taking place on the network 
(e.g., spreading processes of epidemics and innovation), 
which may be considerably affected by the modular struc- 



ture of the graph; 3) to uncover relationships between the 
nodes which are not apparent by inspecting the graph as 
a whole and which can typically be attributed to the 
function of the system. 

Therefore it is not surprising that the last years have 
witnessed an explosion of research on community struc- 
ture in graphs. The main problem, of course, is how to 
detect communities in the first place, and this is the es- 
sential issue tackled by most papers on the topic which 
have appeared in the literature. A huge number of meth- 
ods and techniques have been designed, but the scien- 
tific community has not yet agreed on which methods 
are most reliable and when a method should or should 
not be adopted. This is due to the fact that the concept 
of community is ill-defined. Since the focus has been on 
method development, very little has been done so far to 
address a fundamental question of this endeavor: what 
do communities in real networks look like? This is what 
we will try to assess in this paper. 

Previous investigations have shown that across a wide 
range of networks, the distribution of community sizes 
is broad, with many small communities coexisting with 
some much larger ones [TTl [T5HT5] . The tail of the dis- 
tribution can be often quite well fitted by a power law. 
Leskovec et al. [Ej have carried out a thorough inves- 
tigation of the quality of communities in real networks, 
measured by the conductance score [5U] . They found that 
the lowest conductance, indicating well-defined modules, 
is attained for communities of a characteristic size of 
~ 100 nodes, whereas much larger communities are more 
"mixed" with the rest of the network. For this rea- 
son they suggest that the mesoscopic organization of 
networks may have a core-periphery structure, where 
the periphery consists of small well-defined communities 



and the core comprises larger modules, which are more 
densely connected to each other and therefore harder to 
detect. Guimera and Amaral have proposed a classifica- 
tion of the nodes based on their roles within communi- 
ties [21,. 

However, the fundamental properties of communities 
in real networks are still mostly unknown. Uncover- 
ing such properties is the main goal of this paper. For 
this purpose, we have performed an extensive statistical 
analysis of the community structure of many real net- 
works from nature, society and technology. The main 
conclusion is that communities are characterized by dis- 
tinctive features, which are common for networks of the 
same class but differ from one class to another. Remark- 
ably, such characterization is independent of the specific 
method adopted to find the communities. 



II. DATA AND METHODS 

As our target is to study the statistical features of com- 
munities, we need to employ data sets on large networks 
containing high numbers of communities of varying size. 
Our data sets contain ~ 10^ — 10^ nodes, with excep- 
tion for protein interaction networks (PINs), where the 
largest available data sets are of the order of 10'* nodes. 

Table |l] lists the network datasets we have used, along 
with some basic statistics. Most of them have been 
downloaded from the Stanford Large Network Dataset 
Collection (http://snap. Stanford, edu /data/). Some 
networks are originally directed (e.g., the Web graph), 
but we have treated them as undirected. Further details 
on all networks can be found in the Appendix. 

Overall, we have considered five categories of networks: 

• Communication networks. This class com- 
prises the email network of a large European re- 
search institution, and a set of relationships be- 
tween Wikipedia users communicating via their dis- 
cussion pages. Note that in both cases, communi- 
cation is not necessarily personal but involves, e.g., 
mass emails, and thus these networks cannot be 
considered as social networks. 

• Internet. Here we have two maps of the Internet 
at the Autonomous Systems (AS) level, produced 
by the two main projects exploring the topology of 
the Internet: CAIDA (http : //www . caida. org/J 
and DIMES (http : //www . netdimes . org/). 

• Information networks. This class includes 
a citation network of online preprints in 
www.arxiv.org, a co-purchasing network of 
items sold by www . Eunazon . com and two samples 
of the Web graph, one representing the domains 
berkeley.edu and stanford.edu (Web-BS), the 
other was released by Google (Web-G). 

• Biological networks. This class contains the 
PINs of three organisms: fruit fly {Drosophila 



melanog aster), yeast (Saccharomyces cerevisiae) 
and man {Homo sapiens). 

• Social networks. Here we considered four 
datasets: a network of friendship relationships be- 
tween users of the on-line community LiveJournal 
(www.livejournal.com); the set of trust relation- 
ships between users of the consumer review site 
epinions . com; the friendship network of users of 
slashdot.org; the friedship network of users of 
www. last . fm. 

The problem of choosing a method for detecting com- 
munities is a very delicate one. First, very efficient al- 
gorithms are needed, because the networks we study are 
large. This requirement rules out the majority of exist- 
ing methods. Second, as discussed above, there is no 
common agreement on an all-purpose community detec- 
tion method. This is because of the absence of a shared 
definition of community, which is justified by the nature 
of the problem itself. Consequently, there is also arbi- 
trariness in defining reliable testing procedures for the 
algorithms. Nevertheless, there is a wide consensus on 
the definition of community originally introduced in a 
paper by Condon and Karp [55] . The idea is that a net- 
work has communities if the probability that two nodes 
of the same community are connected exceeds the proba- 
bility that nodes of different communities are connected. 
This concept of community has been implemented to cre- 
ate classes of benchmark graphs with communities, such 
as those introduced by Girvan and Newman |10j and 
the graphs recently designed by Lancichinetti et al. |23) . 
which integrate the benchmark by Girvan and Newman 
with realistic distributions of degree and community size 
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Figure 1: Distribution of community sizes. All distributions 
are broad, and similar for systems in the same category. Data 
points are averages within logarithmic bins of the module size 



Table I: List of the network datasets used for our analysis. For each network we specify the number of nodes and links, the 
average and maximum degree. 



Network statistics 


Category 


name 


# nodes 


# links 


average degree 


max degree 


Communication 


wikitalk 
email 


2,394,385 

265,214 


4,659,560 
364,481 


3.89 
2.75 


100,029 
7,636 


Tuternet 


caida 


26,475 


53,381 


4.03 


2,628 




dimes 


26,211 


76,261 


5.82 


3,988 




Web google 


875,713 


4,322,050 


9.87 


6,332 


Information 


arxiv 


27,770 


352,285 


25.37 


2,468 




amazon 


410,236 


2,439,440 


11.89 


2,760 




WebBS 


685,230 


6,649,470 


19.41 


84,230 




dmela 


7,498 


22,678 


6.05 


178 


Biological 


yeast 


1,870 


2,203 


2.36 


56 




human 


4,998 


21,747 


8.70 


282 




live j 


4,846,609 


42,851,211 


17.68 


20,333 


Social 


epinions 


75,879 


405,740 


10.69 


3,044 




last fm 


2,647,364 


11,245,707 


8.49 


13,431 




slashdot 


773,60 


469,180 


12.13 


2,539 



(LFR benchmark). Recent work indicates that some al- 
gorithms perform very well on the LFR benchmark j21] . 
In particular, the Infomap method introduced by Rosvall 
and Bergstrom j25] has an outstanding performance, and 
it is also fast and thus suitable for large networks. How- 
ever, as every community detection method has its own 
" flavor" and preference towards labeling certain types of 
structure as communities, relying on a single method is 
not enough if general conclusions on community structure 
are to be presented. Therefore we have cross-checked the 
results obtained by Infomap with those produced by a 
very different algorithm, the Label Propagation Method 
(LPM) proposed by Leung et al. [2B]. The latter has 
proven to be reliable on the LFR benchmark and is also 
fast enough to handle the largest systems of our collec- 
tion. Detailed descriptions of Infomap and the LPM are 
given in the Appendix. Here we just point out the pro- 
found differences between the two techniques. Infomap 
is a global optimization method, which aims to optimize 
a quality function expressing the code length of an in- 
finitely long random walk taking place on the graph. 
The LPM is a local method instead, where nodes are 
attributed to the same community where most of their 
neighbors are. The partitions obtained by both methods 
for the same network are in general different. However, 
the general statistical features of community structure do 
not appear to depend much on the details of partitions. 
In the following, only Infomap results will be presented; 
for LPM, see Appendix. 



III. RESULTS 



We begin the analysis by briefly discussing the dis- 
tribution of community sizes (Fig. [1]). We see that, as 
expected, for each system there is a wide range of com- 
munity sizes, spanning several orders of magnitude for 
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Figure 2: Scaled link density of communities as a function of 
the community size. Communication and Internet networks 
consist of essentially tree-like communities, while communities 
of social and information networks are much denser. Small 
modules in biological networks are often tree-like, while larger 
modules are denser. Data points are averages within logarith- 
mic bins of the module size s. 




Figure 3: Visualized examples of communities in networks of different classes. Communication networks (a: email, b: Wiki 
Talk) contain very sparse communities with star-like hubs. These hubs give rise to very low shortest path lengths within 
communities (see Fig. m|. Star-like hubs are are also present in Internet communities (c: DIMES, d: CAIDA), which are 
relatively sparse as well. The CAIDA community displays a "merged-star" structure fairly typical for these networks (see 
Appendix). On the contrary, information networks contain dense communities up to large cliques (e: Amazon, f: Web-BS). In 
biological networks, the larger the community, the less tree-like it is (g: D. melanogaster, h: H. sapiens). Finally, communities 
in social networks appear on average fairly homogeneous (i: Slashdot, j: Epinions). 



the largest systems. This is in agreement with earlier 
studies [TTJ [T5HT51 . The overall shapes of the distribu- 
tions are similar across systems of the same class. Dis- 
tributions for biological networks show the largest differ- 
ences, which, however, is likely to result from noise as 
the networks are smaller. For biological networks, analy- 
sis performed with the LPM shows slightly different, well 
overlapping distributions (see Appendix). 

Next, we turn to the topology of the communities, and 
study the link density of communities and its dependence 
on community size. The link density of a subgraph is de- 
fined as the fraction of existing links to possible links, 
p = 21/ [s {s — 1)], where t is the number of its internal 
links and s its size measured in nodes. Here, we use the 
scaled link density p = ps = 2t/ (s — 1), which also ap- 
proximately amounts to the average community-internal 
degree of nodes in the community. We have chosen this 
measure since it clearly points out the nature of sub- 
graphs. For trees, there are always s — 1 links, and hence 
Ptree = 2. On tlic othcr hand, for full cliques p = 1 and 
hence PcUque = s. 

Fig. [2] displays the average scaled link densities p 
as function of community size for different networks. 
The dashed lines indicate the limiting cases {ptree = 
"^T PcUque = s). We see that the link densities in the 
communication and Internet networks are very close to 
the lower limit, which means that their communities arc 
tree-like and contain only few or no loops. In communi- 
cation networks, the scaled link density does not depend 
on community size, whereas in Internet graphs large com- 



munities appear somewhat denser. Networks in these two 
classes are the sparsest in our collection, as their very 
small average degree indicates that they are overall not 
much denser than trees (see Table IT]). It should be noted 
that in general, the intuitive view on communities is that 
they are "dense" compared to the rest of the network. 
However, as the methods applied here yield partitions, 
the communities of a tree-like network are also necessar- 
ily tree-like. Contrary to the above, the much denser in- 
formation networks reveal a different picture, where com- 
munities are fairly dense objects, with the scaled density 
increasing with s. Especially in the Amazon network, 
communities with s < 10 are almost cliques. Social net- 
works show yet another pattern: the scaled density of 
the modules grows quite regularly with the size s, ap- 
proximately as a power law. Communities in social net- 
works are mostly far from the two limiting cases: they 
are denser than trees, but much sparser than cliques, with 
the exception of small communities which appear more 
tree-like. Finally, the biological networks are character- 
ized by two regimes: for s ^ 10, communities are very 
tree-like; for larger values of s the scaled density increases 
with s. In Fig. I3] the characteristic communities of each 
network class are illustrated. 

The compactness of communities can be measured us- 
ing the average shortest path length (. within each com- 
munity. Fig.|4]displays the average values of £ as function 
of community size s. For all networks, the average short- 
est path lengths I are very small, ^ < 3 with the excep- 
tion of social networks. Interestingly, all plots reveal the 



same basic pattern, independently of the network class. 
For very small communities, (. grows approximately as 
the logarithm of the community size (indicated by the 
dashed line), which is the "small- world" property typi- 
cally observed in complex networks [571. We call these 
modules microcommunities. For sizes s of the order of 
10, however, the increase of £ suddenly becomes less pro- 
nounced, and several curves reach a plateau. Modules 
with > 10 nodes are macrocommunities. The stabiliza- 
tion of the average shortest path length in macrocom- 
munities can be attributed to the presence of nodes with 
high degree, i.e. hubs, which make geodesic paths on av- 
erage short. We remark that, since most of our systems 
have broad degree distributions, shortest path lengths 
are very short J28], but the sharp transition we observe 
is unexpected and appears as an entirely novel feature. 

For communication networks, there is a plateau with 
£ ~' 2 for s > 10. As these communities are tree-like, 
this indicates that they have a star-like structure where 
most nodes are connected to a central hub only and thus 
their distance equals two. For the Internet networks, 
the joint presence of low density and low distances also 
means that hubs dominate the structure - here, "merged- 
star" structures consisting of two or more hubs sharing 
many of their neighbors were observed (see Fig.lsli). This 
structure guarantees an efficient communication between 
the systems' units. On the contrary, information, social, 
and biological networks have a higher density and hence 
their short path lengths are due to both the density and 
the presence of hubs. Hubs play the least dominant role 
in social networks, as the average shortest path lengths 



keep slowly increasing also for large s. 

The above picture is further corroborated by Fig. [5] 
which displays the ratio of the maximal observed 
community-internal degree of nodes max(fci„) to s — 1 as 
a function of the community size s. This ratio equals 
unity if any node is connected to all other nodes in 
its community, and thus it quantifies the dominance of 
hubs within communities. For communication networks, 
max(fci„)/(s — 1) is close to unity even for large s, in ac- 
cordance with the above observations on star-like com- 
munities. For Internet, this quantity somewhat decreases 
with s, as communities may contain multiple hubs which 
do not connect to all other nodes. In information net- 
works, there are some differences. In the Web graphs, the 
largest communities contain nodes connecting (almost) 
the entire community. As the edge density in these com- 
munities is high, there may be several such nodes - in 
a clique, all nodes have degree s — 1. For biological and 
social networks, there is a decreasing trend. Especially 
in social networks, there are few or no dominant hubs in 
large communities. 

Let us next take a closer look at the relationship be- 
tween individual nodes and community structure. Here, 
the most natural property to investigate is the internal 
degree kin , indicating the number of neighbors of a node 
in its community. We measure the embeddedness of a 
node in its community with the ratio kin/k, characteriz- 
ing the extent to which the node's neighborhood belongs 
to the same community as the node itself. The proba- 
bility distribution of the embeddedness ratio of all nodes 
in their respective networks is displayed in Fig. |6] One 
would straighforwardly assume that on average, the em- 
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Figure 4: Average shortest path lengths I within communities 
as a function of community size s. After an initial logarith- 
mic "small- world" regime (dashed diagonal line), the average 
shortest path grows much slower or saturates for communi- 
ties with s > 10 nodes (dotted vertical line). Data points are 
averages within logarithmic bins of module size s. 
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Figure 5: The maximal observed internal degree of nodes as 
a function of the community size a. This quantity equals one 
if any node is linked to all other nodes of its community, and 
thus quantifies the dominance of hubs within communities. 
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Figure 6: Probability distribution for kin/k, the fraction of 
neighbors of a node belonging to its own community. Net- 
works in the same class display similar behavior. 



beddedness of nodes would be fairly large, and a substan- 
tial fraction of their neighbors should reside inside their 
respective communities. However, Fig. [6] shows a more 
intricate pattern, where smaller values of kin/k are not 
at all rare. All of our networks are characterized by a 
substantial fraction of nodes which are entirely internal 
to their communities, i.e. have no links to outside their 
community and thus kin/k — 1. These correspond to 
the rightmost data points in each plot, and such nodes 
typically amount to over 50% of all nodes. These nodes 
have mostly a low degree (such as the degree-one nodes 
connected to hubs in communication networks). Net- 
works in the same class follow essentially a very similar 
pattern. Communication networks and the Internet have 
very similar-looking profiles, where the distribution has a 
peak around kin/k ~ 0.5. Information networks, instead, 
have a rather different profile, with an initial smooth in- 
crease reaching a plateau at about kin/k ~ 0.4. The bi- 
ological networks, despite the inevitable noise, also show 
a consistent picture across datasets. They somewhat re- 
semble the communication and Internet networks, with 
an initial rise until kin/k ~ 0.5, followed by a slow de- 
scent for larger values. Social networks have a rather flat 
distribution over the whole range, with little variations 
from one system to another. This means that there are 
many nodes with most of their neighbors outside their 
own community. Most community detection techniques, 
including the ones we have adopted, tend to assign each 
node to the community which contains the largest frac- 
tion of its neighbors. This implies that if a node has only 
a few neighbors within its own community, it will have 
even fewer neighbors within other individual communi- 
ties. Such nodes act as "intermediates" between many 



different modules, and are shared between many com- 
munities rather than belonging to a single community 
only. Hence it would be more correct to assign them 
to more than one community. Overlapping communities 
are known to be very common in social networks, and 
dedicated techniques for their detection have been intro- 
duced [T11IM32]- 



IV. DISCUSSION AND CONCLUSIONS 

Since the advent of the science of complex networks, 
its focus has shifted from understanding the emergence 
and importance of system-level characteristics to meso- 
scopic properties of networks. These are manifested in 
communities, i.e. densely connected subgraphs. Com- 
munities are ubiquitous in networks and typically play 
an important role in the function of a complex system 
- modules in protein-interaction networks relate to spe- 
cific biological functions, and communities in social net- 
works represent the fundamental level of organization in 
a society. The dual problem of formally defining and ac- 
curately detecting communities has so far attracted the 
most of attention, at the cost of a lack of understanding 
of the fundamental structural properties of communities. 
Our aim in this paper has been to uncover some of these 
properties. 

Our results indicate that communities detected in net- 
works of the same class display surprisingly similar struc- 
tural characteristics. This is remarkable, as some classes 
are really broad and comprise systems of different ori- 
gin (e.g. the class of information networks, which in- 
cludes graphs of citation, co-purchasing and the Web). 
The result is verified by two different community detec- 
tion methods which are both partition-based but rely on 
entirely different principles. In accordance with earlier 
results, community size distributions are broad for all 
systems we have studied. Link densities within commu- 
nities depend strongly on the network class. The average 
shortest path length displays similar behavior across all 
classes, initially increasing logarithmically as a function 
of community size (microcommunities) and then slow- 
ing down or saturating for communities of size s ^ 10 
(macrocommunities) . In combination with our results on 
link density in communities, the behavior of path lengths 
reveals a picture where high-degree nodes are very dom- 
inant in communities of certain classes (communication, 
Internet) and play a less important role in the connec- 
tivity of others, especially social networks. This picture 
is corroborated by the analysis of maximal community- 
internal degrees of nodes. Finally, also the probability 
distribution of the fraction of internal links for nodes dis- 
plays a clear signature for each of the considered classes. 

The signatures we have found are a sort of network 
ID, and could be used both to classify other systems and 
to identify new network classes. Moreover, they could 
become essential elements of network models, with the 
advantage of more accurate descriptions of real networks 



and predictions of their evolution. 

Although our results have been obtained using two dif- 
ferent methods, their general validity merits some discus- 
sion. As the concept of "community" is ill-defined, every 
method for detecting communities is based on a specific 
interpretation of the concept. Furthermore, the underly- 
ing philosophies of methods can largely differ. Methods 
requiring that communities are "locally" very dense, such 
as clique percolation [TS], would detect only a few com- 
munities in the communication and Internet networks, 
as they do not consider trees or stars as communities 
- nevertheless, this result would be consistent for net- 
works of the same class. On the other hand, it is evident 
that partition-based methods neglect the fact that nodes 
may participate in multiple communities. However, it 
is worth noting that whichever method is used, the re- 
sulting communities are actual subgraphs of the network 
under study, i.e. its building blocks. Thus their statisti- 
cal properties reflect the mesoscopic organization of net- 
works, and our results indicate that this organization is 
similar within classes of networks. 
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Appendix A: Data sets: basic statistics 



irPt. 


Communication 


10-= FV 


# Wiki Talk 
■ Email 


lo-* 


V 


J 


10-' 


X 


10-' 


\ 


0-'" 


% J 

1 1 1 1 1 iiiiFiiiil 



Internet 




Information 



'^ 


• «cb-G 
■ Arxiv 

A Web-BS 


L n 


f^^ 



10 10 10" 10 10" 10 10 10 10^ 10" 10 10" 

Biological Social 



10 



^ # Dmela 

T^^ ■ Yeast 

^^^. ^ Human 



„ 


• 


Live J 


% 


■ 


Epinlons 
Lasl FM 




\ * 


Slashdot 


\ 


J 


I 


\ 


1 


I 


1 1 ...* 1 li 



10" 10 10 10" 10" 10 1010 

Degree k 



Figure 7: Degree distributions. 
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In Fig. [T] we show the degree distributions for all the 
networks. The degree distribution spans several orders 
of magnitude. In Fig. [S] the clustering coefficient [13 of 
nodes with degree k is plotted as a function of fc, defined 
as the number of links between neighbors t of the node 
divided by the maximum possible number of such links f: 
c — t/{^k {k — 1)). As we can see, the shape of the clus- 
tering spectrum is basically the same across all networks, 
with a rapid decrease of the clustering coefficient with fc, 
except for the Web graphs, which are known to include 
very dense subgraphs and cliques, for which the cluster- 
ing coefficient can be appreciably high also for nodes of 
degree ~ 100. In Fig. [9] we report the average degree 
knn of the neighbors of nodes with degree k again as a 
function of k [33j . Communication networks, the Inter- 
net and the Web graphs are clearly disassortative, the 
other networks are either moderately disassortative or 
do not exhibit a particular correlation. Only the Live- 
journal friendship network has an assortative pattern for 
intermediate degree- values. 



Figure 8: Clustering coefficient versus degree. 



Appendix B: The community detection methods 

In this section, we briefly explain the two commu- 
nity detection algorithms. For a detailed description, the 
reader is referred to the original publications. 

Infomap j25j is based on the idea that a random walker 
exploring the network should get trapped inside dense 
modules for a fairly long time, and cross the boundaries 
of modules only infrequently. This simple idea is for- 
malized by considering the problem of finding the opti- 
mal description of the path of the walker, which can be 
achieved by labelling every node with a prefix given by 
a unique name for the module it belongs to and a suffix 
given by a unique name within its module. The labels of 
nodes, while unique within their module, can be recycled 
in different modules to achieve the most compressed de- 
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Figure 9: Average nearest-neighbor degree fc„„ versus degree. 



Appendix C: Main results from the Label 
Propagation Method 



In order to verify that our results are not due to the 
method alone, but represent real features of the meso- 
scopic organization of the networks, we have carried the 
analysis presented for Infomap in the main paper with 
the Label Propagation method as well. The following 
plots show the characteristics presented in the main pa- 
per obtained via the label propagation method. Results 
are consistent with those obtained with Infomap. 
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scription. According to such two-level description, given 
a partition of the graph, one can compute the amount of 
information needed to describe the path of the walker. If 
the network has a well-defined community structure, the 
code length of the two-level description may be shorter 
than the code length of the one- level description, in which 
each node has a unique name, as the walker will perform 
most of its steps within each module and comparatively 
few between the modules. In this way, the recycling of 
the labels leads to a more compact description of the 
process. Then the problem of Infomap is finding the par- 
tition which gives the smallest description length. This 
optimization problem is solved using a greedy optimiza- 
tion algorithm in order to obtain the results in reasonable 
time. The use of random walks makes the method nat- 
urally generalizable to the case of directed and weighted 
graphs. For directed graphs, due to the possibility of 
having dangling ends, which are sinks for the diffusion 
process, it is necessary to introduce a teleportation fac- 
tor, similarly to Google's PageRank algorithm [34j . 

The Label Propagation Method |35] basically simulates 
the spreading of labels based on the simple rule that at 
each iteration a given node takes the most frequent label 
in its neighborhood. The starting configuration is chosen 
such that every node is given a different label and the 
procedure is iterated until convergence. This method has 
the problem of partitioning the network such that there 
are very big clusters, due to the possibility of a few labels 
to propagate over large portions of the graph. The LPM 
version that we used in our analysis is a modification by 
Leung et al. [55] that handles this problem by introducing 
a hop score which tells how far a certain label is from its 
origin. The hop score is decreased while the label spreads 
through the network and this improves the quality of the 
partitions found by the method. 
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Figure 10: Distribution of community sizes. 



Communication 



Internet 



Information 




• wcb-G 
■ Arxiv 
•;■■■ Amazon 
A web-BS 


, 1 M, 1 ,_ 







10 10" 10 



Biological 



Social 





111 1 1 Ml /I 








# Dmela 




■ YeasL 






- 


♦ Hiinian 




- 


1 




- 


- 1 

- / 


• 


- 


^«kUi* 


III 


- 



- • Live J 

2 _ ■ Epinions 
Last FM 

- A Slaslulot 



' - ' •- * 

*&AA_A 



LPM 



10 10" 10" 

Module Size s 



Figure 11: Scaled link density of communities as a function 
of the community size. 
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Figure 12: Average siiortest patli of a community as a func- 
tion of the community size s. 



Figure 14: Distribution of the fraction of neighbors of a node 
belonging to the community of the node. 



Appendix D: Further Statistics on Community 
Properties 

In this section, we want to show some other statis- 
tical properties of the modules. All figures display the 
results obtained using Infomap (upper panel) and the 
Label Propagation Method (lower panel). 

As in the main article only the average values of link 
densities are shown, we first want to show what the prob- 
ability distribution of the link density p of communities 



looks like. Fig. 15 shows that in all the systems there 
are dense modules together with sparser modules. Nev- 
ertheless, there is a dependency on the size of the mod- 
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Figure 13: Ratio between the maximum internal degree 
max(fci„) of a node and the maximum possible number of 
internal neighbors s — 1 as a function of s, the module size. 



ules: Fig. [16] shows what happens if we discard very 
small communities, with less than 3 nodes (s < 3), and 



Fig. 17 displays what is left when we consider fairly big 
modules, s > 10; only social and information networks 
include dense modules even after this filtering. 

Next, we show the average internal clustering coeffi- 
cient as a function of the module size s, Fig. [18] The 
clustering coefficient c is a node property defined as the 
number of links between neighbors t of the node divided 
by the maximum possible number of such links for a node 
with the same degree k: c — t/{^k [k — 1)). For nodes 
with degree smaller than two we consider the clustering 
coefficient to be undefined and leave them out of the cal- 
culations of the averages. Here, "internal" means that 
the clustering coefficient is computed by only consider- 
ing the subgraph of the community, which includes only 
the internal links in the community. For communication 
systems and the Internet, the average internal clustering 
coefficient of large communities can reach fairly high val- 
ues although the corresponding densities p are low. This 
can be explained in terms of "merged-star" structures, 
where two (or more) high-degree nodes are connected, 
their neighbours have a low degree (approx. the number 
of hubs) and are connected to all hubs. As then the clus- 
tering coefficient for these nodes is typically unity and 
their number is large, they dominate the average cluster- 
ing coefficient within the community. 



Appendix E: Further Statistics of Node Properties 

Here we focus on properties of nodes with respect to 
their communities. Again, we show results obtained us- 
ing both Infomap (upper panel) and the Label Propaga- 
tion Method (lower panel). 

In Fig. [l9] we show the distribution of the fraction 
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Figure 15: Distribution of the link density for s > 1. 



Figure 16: Distribution of tiie link density for s > 3. 



of neighbors of a node belonging to its community when 
one only considers nodes with degree A: > 3 (in the main 
manuscript we considered all nodes). Fig. 20 is the same 
plot, but including only nodes with degrees larger than 
10. The two plots display flatter curves than the full dis- 
tributions, and they look much smoother, indicating that 
the fluctuations observed in the full curves are mostly due 
to low degree nodes. Low degree nodes can cause peaks 
in the plots because the values of the fraction kin/k are 
quantized {e.g. for a node of degree two the fraction 
must be 0, 0.5 or 1). By observing the rightmost points 
of the curves, we see that they lie much lower than the 
corresponding points of the full curves, except for the in- 
formation networks. This means that many nodes which 
are fully embedded in their community have low degree, 
as expected. 
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Figure 17: Distribution of tlie link density for s > 10. 



Figure 18: Internal clustering coefficient as a function of the 
module size. 
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Figure 19: Distribution of kin/k, for A; > 3. 



Figure 20: Distribution of kin/k, for k > 10. 
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Network degree distribution 


Category 


name 


exponent 


min degree 


exp error 


p— value 


Communication 


wikitalk 
email 


-2.46 
-2.93 


1 
1 


0.01 
0.01 






Internet 


caida 
dimes 


-2.12 
-2.2 


5 
2 


0.03 
0.01 


0.6 
0.2 




Web Google 


-2.68 


23 


0.01 


0.8 


Information 


arxiv 


-3.19 


61 


0.04 


0.3 




amazon 


-3.27 


17 


0.03 







Web BS 


-2.59 


46 


0.02 







dmela 


- 3.50 


28 


0.05 





Biological 


yeast 


-3.0 


5 


0.5 


0.4 




human 


3.0 


31 


0.2 


0.1 




live j . 


-2.8 


86 


0.1 


0.3 


Social 


epinions 


-1.70 


1 


0.01 







last fm 


-2.9 


35 


0.1 







slashdot 


-2.5 


43 


0.5 


0.8 



Table II: Power-law exponents of the degree distribution and the minimum degree from which the fit holds. We used maximum 
likelihood fitting |36| . 



Community size 


distribution (from Infomap) 




Category 


name 


exponent 


min degree 


exp error 


p— value 


Communication 


wikitalk 
email 


-2.7 
-2.8 


881 
674 


0.3 
0.3 


0.1 
0.5 


Internet 


caida 
dimes 


-2.10 
-2.00 


11 
18 


0.05 
0.05 


0.4 
0.9 




Web Google 


-2.57 


89 


0.03 


0.2 


Information 


arxiv 


-2.4 


69 


0.3 


0.5 




amazon 


-3.5 


97 


0.2 


0.02 




Web BS 


-2.4 


36 


0.1 







dmela 


- 3.5 


9 


0.1 


0.1 


Biological 


yeast 


-3.05 


8 


0.05 


0.1 




human 


2.6 


8 


0.1 


0.1 




live j . 


-2.22 


59 


0.02 





Social 


epinions 


-2.5 


13 


0.2 


0.3 




last fm 


-2.70 


34 


0.05 







slashdot 


-3.5 


10 


0.1 






Table III: Power-law exponents of the comnmnity size distribution derived from Infomap. We used maximum likelihood fitting 

ESI. 
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Community size distribution (from LPM) 


Category 


name 


exponent 


min degree 


exp error 


p— value 


Communication 


wikitalk 
email 


-2.6 
-2.4 


1145 
248 


0.2 
0.1 


0.4 
0.2 


Internet 


caida 
dimes 


-2.08 
-1.95 


13 

12 


0.08 
0.05 


0.4 
0.8 




Web Google 


-2.45 


36 


0.02 


0.1 


Information 


arxiv 


-2.0 


16 


0.1 


0.1 




amazon 


-2.80 


30 


0.05 


0.3 




Web BS 


-2.0 


107 


0.1 


0.7 




dmela 


- 2.7 


10 


0.5 


0.3 


Biological 


yeast 


-2.6 


5 


0.5 


0.3 




human 


1.9 


2 


0.05 


0.2 




live j. 


-2.40 


86 


0.05 


0.1 


Social 


epinions 


-2.40 


5 


0.05 


0.2 




last fm 


-2.9 


35 


0.1 


0.1 




slashdot 


-2.7 


24 


0.1 






Table IV: Power-law exponents of the community size distribution derived from the LPM. We used maximum likelihood fitting 

ESI. 
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